Correcting ‘Wrong-Column’ Errors in Text Databases

نویسندگان

  • Caroline Sporleder
  • Marieke van Erp
چکیده

We present a novel data-driven approach for detecting and correcting errors in text databases. We focus on information that was accidentally entered in an incorrect column. Unlike machine-learning approaches to data cleaning that assume the database cells to contain atomic or numeric content, our method takes into account substrings of textual cells, and treats error detection and correction as a text categorisation task. Errors are detected at points where the classifier disagrees with the data; corrections are the suggestions put forward by the classifier. We demonstrate that the method is suited for high-recall detection of errors in freetext columns of a zoological database, with a high correction accuracy as well.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spotting The 'Odd-One-Out': Data-Driven Error Detection And Correction In Textual Databases

We present two methods for semiautomatic detection and correction of errors in textual databases. The first method (horizontal correction) aims at correcting inconsistent values within a database record, while the second (vertical correction) focuses on values which were entered in the wrong column. Both methods are data-driven and language-independent. We utilise supervised machine learning, b...

متن کامل

تصحیح قیاسی برخی از عبارات دشوار شرح شطحیات

The current article aims at reviewing and correcting some difficult and obscure words in Description of Shathyyāt written by Roozbehān Baqali. Similar to the mystic texts, this book is found to use technical writing style which causes it to be one of the complicated mystic passages. Some complexities of this book, however, are assumed to be originated in errors and inaccuracies of text. A Compa...

متن کامل

Can Confidence Scores Post-editing Speech Recog

When dictating with speech recognition, most of the user’s time is spent correcting errors. To decrease the burden we propose new editor functions specifically to speed up the correction process. The idea is to use a recognition confidence measure to predict which words are likely to be in error, to display that information to the user by highlighting suspect words, and to provide a command to ...

متن کامل

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...

متن کامل

Frequency, Type and Causes of Medication Errors in Pediatric Wards of Hospitals in Yazd, the Central of Iran

Background Medication errors are among the most common medical errors which are used as an indicator to assess patients’ safety in hospitals. Thereby the aim of this study was to investigate the frequency, type and causes of medication errors in children's ward at hospitals in Yazd- Iran. Materials and Methods This descriptive-analytical study was conducted during 6 months from Jan to Jun 2015....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006